Trees are important to our environment as they provide home and food to many different organisms; they also take up carbon dioxiide and release oxygen into our ecosystem. Many different factors can determine trees' growth, but sunlight, water and nutrients are essential for their growth.
This project analyzes trees planted on streets of Metro Vancouver and examines their growth based on different category variables provided.
import pandas as pd
import altair as alt
trees = pd.read_csv('small_unique_vancouver.csv', parse_dates = ['date_planted'])
trees.head()
| Unnamed: 0 | std_street | on_street | species_name | neighbourhood_name | date_planted | diameter | street_side_name | genus_name | assigned | ... | plant_area | curb | tree_id | common_name | height_range_id | on_street_block | cultivar_name | root_barrier | latitude | longitude | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 10747 | W 20TH AV | W 20TH AV | PLATANOIDES | Riley Park | 2000-02-23 | 28.5 | EVEN | ACER | N | ... | 15 | Y | 21421 | NORWAY MAPLE | 4 | 0 | NaN | N | 49.252711 | -123.106323 |
| 1 | 12573 | W 18TH AV | W 18TH AV | CALLERYANA | Arbutus-Ridge | 1992-02-04 | 6.0 | ODD | PYRUS | N | ... | 7 | Y | 129645 | CHANTICLEER PEAR | 2 | 2300 | CHANTICLEER | N | 49.256350 | -123.158709 |
| 2 | 29676 | ROSS ST | ROSS ST | NIGRA | Sunset | NaT | 12.0 | ODD | PINUS | N | ... | 7 | Y | 154675 | AUSTRIAN PINE | 4 | 7800 | NaN | N | 49.213486 | -123.083254 |
| 3 | 8856 | DOMAN ST | DOMAN ST | AMERICANA | Killarney | 1999-11-12 | 11.0 | EVEN | FRAXINUS | N | ... | 7 | Y | 180803 | AUTUMN APPLAUSE ASH | 4 | 6900 | AUTUMN APPLAUSE | N | 49.220839 | -123.036721 |
| 4 | 21098 | EAST BOULEVARD | EAST BOULEVARD | HIPPOCASTANUM | Shaughnessy | NaT | 15.5 | ODD | AESCULUS | Y | ... | N | Y | 74364 | COMMON HORSECHESTNUT | 4 | 5200 | NaN | N | 49.238514 | -123.154958 |
5 rows × 21 columns
The first column does not appear to be useful as it is an identifier, therefore it will be dropped.
trees.drop(columns = trees.columns[0], inplace=True)
trees.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5000 entries, 0 to 4999 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 std_street 5000 non-null object 1 on_street 5000 non-null object 2 species_name 5000 non-null object 3 neighbourhood_name 5000 non-null object 4 date_planted 2363 non-null datetime64[ns] 5 diameter 5000 non-null float64 6 street_side_name 5000 non-null object 7 genus_name 5000 non-null object 8 assigned 5000 non-null object 9 civic_number 5000 non-null int64 10 plant_area 4950 non-null object 11 curb 5000 non-null object 12 tree_id 5000 non-null int64 13 common_name 5000 non-null object 14 height_range_id 5000 non-null int64 15 on_street_block 5000 non-null int64 16 cultivar_name 2658 non-null object 17 root_barrier 5000 non-null object 18 latitude 5000 non-null float64 19 longitude 5000 non-null float64 dtypes: datetime64[ns](1), float64(3), int64(4), object(12) memory usage: 781.4+ KB
The tree dataset has 5000 observations and 21 columns; all columns except date_planted, plant_area, and cultivar_name have full observation. date_planted has more than half of the observation missing, plant_name has 50 missing, and cultivar_name has just under half of the observation missing. The data has 12 string objects, 3 float values, 3 integer values, and 1 datetime value.
trees.describe(include = 'all', datetime_is_numeric=True)
| std_street | on_street | species_name | neighbourhood_name | date_planted | diameter | street_side_name | genus_name | assigned | civic_number | plant_area | curb | tree_id | common_name | height_range_id | on_street_block | cultivar_name | root_barrier | latitude | longitude | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 5000 | 5000 | 5000 | 5000 | 2363 | 5000.000000 | 5000 | 5000 | 5000 | 5000.000000 | 4950 | 5000 | 5000.000000 | 5000 | 5000.00000 | 5000.000000 | 2658 | 5000 | 5000.000000 | 5000.000000 |
| unique | 603 | 607 | 171 | 22 | NaN | NaN | 4 | 67 | 2 | NaN | 38 | 2 | NaN | 361 | NaN | NaN | 176 | 2 | NaN | NaN |
| top | W 13TH AV | CAMBIE ST | SERRULATA | Renfrew-Collingwood | NaN | NaN | ODD | ACER | N | NaN | 10 | Y | NaN | KWANZAN FLOWERING CHERRY | NaN | NaN | KWANZAN | N | NaN | NaN |
| freq | 52 | 49 | 463 | 384 | NaN | NaN | 2554 | 1218 | 4564 | NaN | 736 | 4593 | NaN | 383 | NaN | NaN | 383 | 4679 | NaN | NaN |
| mean | NaN | NaN | NaN | NaN | 2003-09-06 04:03:08.912399488 | 12.340888 | NaN | NaN | NaN | 2975.707600 | NaN | NaN | 128682.584600 | NaN | 2.73440 | 2960.227000 | NaN | NaN | 49.247349 | -123.107128 |
| min | NaN | NaN | NaN | NaN | 1989-10-31 00:00:00 | 0.000000 | NaN | NaN | NaN | 2.000000 | NaN | NaN | 36.000000 | NaN | 0.00000 | 0.000000 | NaN | NaN | 49.202783 | -123.220560 |
| 25% | NaN | NaN | NaN | NaN | 1997-11-06 00:00:00 | 4.000000 | NaN | NaN | NaN | 1300.500000 | NaN | NaN | 61321.500000 | NaN | 2.00000 | 1300.000000 | NaN | NaN | 49.230152 | -123.144178 |
| 50% | NaN | NaN | NaN | NaN | 2003-02-12 00:00:00 | 10.000000 | NaN | NaN | NaN | 2639.000000 | NaN | NaN | 130130.500000 | NaN | 2.00000 | 2600.000000 | NaN | NaN | 49.247981 | -123.105861 |
| 75% | NaN | NaN | NaN | NaN | 2009-11-17 00:00:00 | 18.000000 | NaN | NaN | NaN | 4123.000000 | NaN | NaN | 191332.000000 | NaN | 4.00000 | 4100.000000 | NaN | NaN | 49.263275 | -123.063484 |
| max | NaN | NaN | NaN | NaN | 2019-05-07 00:00:00 | 71.000000 | NaN | NaN | NaN | 9113.000000 | NaN | NaN | 270750.000000 | NaN | 9.00000 | 9100.000000 | NaN | NaN | 49.293930 | -123.023311 |
| std | NaN | NaN | NaN | NaN | NaN | 9.266600 | NaN | NaN | NaN | 2078.580429 | NaN | NaN | 75412.260406 | NaN | 1.56957 | 2086.861052 | NaN | NaN | 0.021251 | 0.049137 |
Based on the questions of interest, only the columns neighbourhood_name, date_planted, diameter, genus_name, height_range_id, and root_barrier will be used. The other columns will be dropped. The column date_planted is formatted using YYYY-MM-DD. As only the year will be useful, it will be extracted and converted to integer for easier access.
trees['year_planted'] = trees['date_planted'].dt.year.astype('Int64')
trees = trees[['neighbourhood_name', 'year_planted', 'diameter', 'genus_name', 'height_range_id', 'root_barrier']]
trees.head()
| neighbourhood_name | year_planted | diameter | genus_name | height_range_id | root_barrier | |
|---|---|---|---|---|---|---|
| 0 | Riley Park | 2000 | 28.5 | ACER | 4 | N |
| 1 | Arbutus-Ridge | 1992 | 6.0 | PYRUS | 2 | N |
| 2 | Sunset | <NA> | 12.0 | PINUS | 4 | N |
| 3 | Killarney | 1999 | 11.0 | FRAXINUS | 4 | N |
| 4 | Shaughnessy | <NA> | 15.5 | AESCULUS | 4 | N |
Before diving into the question, let's visualize how many trees are present within each neighbourhood. It could be a good indication that if certain neighbourhoods have more trees, they would have more resources for trees to grow on.
alt.Chart(trees).mark_bar().encode(
alt.X('neighbourhood_name', sort = '-y', title = "Neighbourhood"),
alt.Y('count()')
).properties(title = "Number of trees present in each neighbourhood")
According to the bar graph of the number of trees in each neighbourhood, the neighbourhoods with the most trees planted are Renfrew-Collingwood and Kensington-Cedar Cottage; they each have over 350 trees. The neighbourhood with the least number of trees is Strathcona with less than 100.
Trees grow primarily in their height and length while their thickness is determined by secondary growth. As the growth in their height is the primary growth, I will be only focusing on the column height_range_id.
Let's take a look at the height of the trees in each neighbourhood for each year planted.
alt.Chart(trees, width = 190, height = 80).mark_bar().encode(
alt.X('neighbourhood_name', title = "Neighbourhood", sort = '-y'),
alt.Y('mean(height_range_id)', title = "Mean of height")
).facet('year_planted', columns = 4, title = "Mean diameter of trees in each neighbourhood for different years").resolve_scale(y = 'independent', x = 'independent')
From the faceted graph, trees' growth and their neighbourhood do not appear to be correlated with one another.
If more trees are planted in certain areas than others, they would need to share and compete for resources, which would slow down their growth. Let's compare the graphs of the average height of trees for each year and the total number of trees planted each year.
height_mean_chart = alt.Chart(trees).mark_circle().encode(
alt.X('year_planted', title = "Year", scale = alt.Scale(domain = [1988, 2020])),
alt.Y('mean(height_range_id)', title = "Height range mean")
).properties(title = "Average trees' height range for each year planted")
height_mean_chart
tree_yearly_chart = alt.Chart(trees).mark_bar().encode(
alt.Y('count()'),
alt.X('year_planted', title = "Year", scale = alt.Scale(domain = [1988, 2020]))
).properties(title = "Total number of trees planted")
tree_yearly_chart
Some correlation appears to be presence between the average height of trees' growth and the total number of trees planted each year. In the years where less trees are planted compared to a few years prior or after (1991, 2003), the average height of the trees appears to be taller.
Let's explore how many trees within the dataset have root barriers installed.
barrier_chart = alt.Chart(trees).mark_bar().encode(
alt.Y('root_barrier', title = "Root barrier"),
alt.X('count()')
).properties(title = "Number of trees with and without root barriers")
barrier_chart
The bar graph shows that less than 10% of the tree dataset do not have root barrier installed. Let's now compare the mean height of trees for the presence and absence of root barrier.
barrier_height_chart = alt.Chart(trees).mark_bar().encode(
alt.X('mean(height_range_id)', title = "Average tree height range"),
alt.Y('root_barrier', title = "Root barrier")
).properties(title = "Average trees' height for root barrier")
barrier_height_chart
Trees without root barriers appear to have higher average height compared to trees with root barriers. They grow twice as much as those with root barriers.
alt.Chart(trees, width = 80, height = 450).mark_bar().encode(
alt.Y('genus_name', sort = '-x', title = "Genus"),
alt.X('mean(height_range_id)', title = "Average height")
).facet('year_planted', columns = 4, title = "Average height of each genera of tree every year").resolve_scale(y = 'independent', x = 'independent')
From the above graph, some genera of trees appear to grow taller others. The most obvious appears to be of the genus Platanus as for the years that it appears in, it is located within the top rows.
The categorical variables of interest neighbourhood_name and genus_name have many different unique values and can often get confusing. It would be of best interest to filter the data so that the columns would have only around 10 unique values. Similarly, with the column year_planted, it would be of best interest to filter the data so that the data would only contain trees from 1992 to 2014 since most trees were planted within those years.
The 4 graphs which will be useful in the final project are the first, second, sixth, and seventh graphs as they seem most fitting with the questions of interest.